AITopics | product attention

These networks rely heavily on the dot product attention operator, which computes the similarity between two points by taking their inner product. However, the inner product does not explicitly model the complex structural properties of real world datasets, such as hierarchies between data points.

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > California (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)

Add feedback

Coneheads: Hierarchy Aware Attention

Neural Information Processing SystemsOct-9-2025, 03:14:15 GMT

These networks rely heavily on the dot product attention operator, which computes the similarity between two points by taking their inner product. However, the inner product does not explicitly model the complex structural properties of real world datasets, such as hierarchies between data points.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > California (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Fast and Simplex: 2-Simplicial Attention in Triton

Roy, Aurko, Chou, Timothy, Duvvuri, Sai Surya, Chen, Sijia, Yu, Jiecao, Wang, Xiaodong, Zaheer, Manzil, Anil, Rohan

arXiv.org Artificial IntelligenceJul-4-2025

Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count together. However, these scaling laws assume an infinite supply of data and apply primarily in compute-bound settings. As modern large language models increasingly rely on massive internet-scale datasets, the assumption that they are compute-bound is becoming less valid. This shift highlights the need for architectures that prioritize token efficiency. In this work, we investigate the use of the 2-simplicial Transformer, an architecture that generalizes standard dot-product attention to trilinear functions through an efficient Triton kernel implementation. We demonstrate that the 2-simplicial Transformer achieves better token efficiency than standard Transformers: for a fixed token budget, similarly sized models outperform their dot-product counterparts on tasks involving mathematics, coding, reasoning, and logic. We quantify these gains by demonstrating that $2$-simplicial attention changes the exponent in the scaling laws for knowledge and reasoning tasks compared to dot product attention.

dtype, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2507.02754

Country: North America > United States > California (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Coneheads: Hierarchy Aware Attention

Tseng, Albert, Yu, Tao, Liu, Toni J. B., De Sa, Christopher

arXiv.org Artificial IntelligenceDec-3-2023

Attention networks such as transformers have achieved state-of-the-art performance in many domains. These networks rely heavily on the dot product attention operator, which computes the similarity between two points by taking their inner product. However, the inner product does not explicitly model the complex structural properties of real world datasets, such as hierarchies between data points. To remedy this, we introduce cone attention, a drop-in replacement for dot product attention based on hyperbolic entailment cones. Cone attention associates two points by the depth of their lowest common ancestor in a hierarchy defined by hyperbolic cones, which intuitively measures the divergence of two points and gives a hierarchy aware similarity score. We test cone attention on a wide variety of models and tasks and show that it improves task-level performance over dot product attention and other baselines, and is able to match dot-product attention with significantly fewer parameters. Our results suggest that cone attention is an effective way to capture hierarchical relationships when calculating attention.

cone, cone attention, product attention, (13 more...)

arXiv.org Artificial Intelligence

2306.00392

Country:

North America > United States > California (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)

Add feedback

Equiformer: Equivariant Graph Attention Transformer for 3D Atomistic Graphs

Liao, Yi-Lun, Smidt, Tess

arXiv.org Artificial IntelligenceFeb-27-2023

Despite their widespread success in various domains, Transformer networks have yet to perform well across datasets in the domain of 3D atomistic graphs such as molecules even when 3D-related inductive biases like translational invariance and rotational equivariance are considered. In this paper, we demonstrate that Transformers can generalize well to 3D atomistic graphs and present Equiformer, a graph neural network leveraging the strength of Transformer architectures and incorporating SE(3)/E(3)-equivariant features based on irreducible representations (irreps). First, we propose a simple and effective architecture by only replacing original operations in Transformers with their equivariant counterparts and including tensor products. Using equivariant operations enables encoding equivariant information in channels of irreps features without complicating graph structures. With minimal modifications to Transformers, this architecture has already achieved strong empirical results. Second, we propose a novel attention mechanism called equivariant graph attention, which improves upon typical attention in Transformers through replacing dot product attention with multi-layer perceptron attention and including non-linear message passing. With these two innovations, Equiformer achieves competitive results to previous models on QM9, MD17 and OC20 datasets.

artificial intelligence, machine learning, vector, (19 more...)

arXiv.org Artificial Intelligence

2206.1199

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > Germany > Berlin (0.04)

Genre: Research Report (0.50)

Industry: Materials > Chemicals (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Understanding Attention for Vision-and-Language Tasks

Cao, Feiqi, Han, Soyeon Caren, Long, Siqu, Xu, Changwei, Poon, Josiah

arXiv.org Artificial IntelligenceSep-22-2022

Attention mechanism has been used as an important component across Vision-and-Language(VL) tasks in order to bridge the semantic gap between visual and textual features. While attention has been widely used in VL tasks, it has not been examined the capability of different attention alignment calculation in bridging the semantic gap between visual and textual clues. In this research, we conduct a comprehensive analysis on understanding the role of attention alignment by looking into the attention score calculation methods and check how it actually represents the visual region's and textual token's significance for the global assessment. We also analyse the conditions which attention score calculation mechanism would be more (or less) interpretable, and which may impact the model performance on three different VL tasks, including visual question answering, text-to-image generation, text-and-image matching (both sentence and image retrieval). Our analysis is the first of its kind and provides useful insights of the importance of each attention alignment score calculation when applied at the training phase of VL tasks, commonly ignored in attention-based cross modal models, and/or pretrained models. Our code is available at: https://github.com/adlnlp/Attention_VL

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2208.08104

Country:

Oceania > Australia > Western Australia (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > Illinois (0.04)
(3 more...)

Genre:

Workflow (0.46)
Research Report (0.40)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Image Captioning with an End to End Transformer Network.

#artificialintelligenceJun-9-2022, 08:54:33 GMT

Transformer Networks are deep learning models that learn context and meaning in sequential data by tracking the relationships between the sequences. Since the introduction of Transformer Networks in 2017 by Google Brain in their revolutionary paper "Attention is all you need", transformers have been outperforming conventional neural networks in various problem domains, like Neural Machine Translation, Text Summarization, Language Understanding, and other Natural Language Processing tasks. Along with this, they have also proved to be quite effective in Computer Vision tasks like Image Classification with Vision Transformers and Generative Networks as well. In this article, I will be trying to elaborate on my understanding of the attention mechanism through vision transformers and on sequence to sequence tasks through Transformer Networks. For problems in the Image Domain, like Image Classification and feature extraction from Images, Deep Convolutional Neural Network architectures like ResNet and Inception are used.

product attention, sequence, transformer network, (15 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

ML Collective's ICML Paper: A Probabilistic Interpretation of Transformers

#artificialintelligenceMay-20-2022, 07:07:56 GMT

Since their introduction in 2017, transformers have become the go-to machine learning architecture for natural language processing (NLP) and computer vision. Although they have achieved state-of-the-art performance in these fields, the theoretical framework underlying transformers remains relatively underexplored. In the new paper A Probabilistic Interpretation of Transformers, ML Collective researcher Alexander Shim provides a probabilistic explanation of transformers' exponential dot product attention and contrastive learning based on distributions of the exponential family. An oft-proposed explanation for transformers' power and performance is their attention mechanisms' superior ability to model dependencies in long input sequences. But this doesn't directly address how and why transformer architecture choices such as exponential dot product attention outperform the alternatives.

ml collective, probabilistic interpretation, transformer, (6 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.37)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.37)

Add feedback

Filters

Collaborating Authors

product attention

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

a17251f8d595179eef5e466b1f5f7a85-Supplemental-Conference.pdf

a17251f8d595179eef5e466b1f5f7a85-Paper-Conference.pdf

Coneheads: Hierarchy Aware Attention

Coneheads: Hierarchy Aware Attention

Fast and Simplex: 2-Simplicial Attention in Triton

Coneheads: Hierarchy Aware Attention

Equiformer: Equivariant Graph Attention Transformer for 3D Atomistic Graphs

Understanding Attention for Vision-and-Language Tasks

Image Captioning with an End to End Transformer Network.

ML Collective's ICML Paper: A Probabilistic Interpretation of Transformers